Data Quality in Data Warehouses
نویسنده
چکیده
Fayyad and Uthursamy (2002) have stated that the majority of the work (representing months or years) in creating a data warehouse is in cleaning up duplicates and resolving other anomalies. This article provides an overview of two methods for improving quality. The first is data cleaning for finding duplicates within files or across files. The second is edit/imputation for maintaining business rules and for filling in missing data. The fastest data-cleaning methods are suitable for files with hundreds of millions of records (Winkler, 1999b, 2003b). The fastest edit/imputation methods are suitable for files with millions of records (Winkler, 1999a, 2004b).
منابع مشابه
A Framework for Data Cleaning in Data Warehouses
It is a persistent challenge to achieve a high quality of data in data warehouses. Data cleaning is a crucial task for such a challenge. To deal with this challenge, a set of methods and tools has been developed. However, there are still at least two questions needed to be answered: How to improve the efficiency while performing data cleaning? How to improve the degree of automation when perfor...
متن کاملData Quality Management in Web Warehouses using BPM
The increasing amount of data published on the Web poses the new challenge of making possible the exploitation of these data by different kinds of users and organizations. Additionally, the quality of published data is highly heterogeneous and the worst problem is that it is unknown for the data consumer. In this context, we consider Web Warehouses (WW) (Data Warehouses populated by web data so...
متن کاملDesign and Analysis of Quality Information for Data Warehouses
Data warehouses are complex systems that have to deliver highly-aggregated, high quality data from heterogeneous sources to decision makers. Due to the dynamic change in the requirements and the environment, data warehouse system rely on meta databases to control their operation and to aid their evolution. In this paper, we present an approach to assess the quality of the data warehouse via a s...
متن کاملUsing Time Series to Assess Data Quality in Telecommunications Data Warehouses
The growing complexity of telephone services, particularly in mobile telephony, and its impact upon billing data mean that phone call volume modelling techniques became crucial in the assessment of the accuracy of the information available in telecommunications data warehouses. Time series modelling, normally used for forecasting, provide a suitable tool for this purpose as well. Preliminary ex...
متن کاملMethodological Guidelines and Adaptive Statistical Data Validation to Build Effective Data Warehouses
Over time, data integration involving data warehouses is becoming more difficult to develop and to manage due to the growing heterogeneity of data sources. Despite the significant advances in research and technologies, many integration projects are still too slow to generate pragmatic results and are often abandoned before that. The objective of this work is the specification of a developing st...
متن کاملRepository Support for Data Warehouse Evolution
Data warehouses are complex systems consisting of many components which store highlyaggregated data for decision support. Due to the role of the data warehouses in the daily business work of an enterprise, the requirements for the design and the implementation are dynamic and subjective. Therefore, data warehouse design is a continuous process which has to reflect the changing environment of a ...
متن کامل